Skip to content

added snakemake-like Rule #638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jonathanBieler
Copy link

This adds a way to declare rules that binds input and outputs files and rerun them if files are outdated/not existing, a bit like in snakemake (although quite basic). I've toyed with more advanced way of (re)-triggering the computation but I think it would require a much complicated machinery (CLI, caching or runs, hashing of the code, etc), so here I just added a way to manually retrigger a computation with a keyword.

If there's interest for this I can add docs.

cf. https://discourse.julialang.org/t/dagger-dates-snakemake/126111

Example :

using CSV, DataFrames, Dagger, Statistics

## prepare data

dir = mktempdir()

mean_squared_input = Float64[]
for sample_idx in 1:5
   x = rand(10)
   CSV.write("$(dir)/sample_$(sample_idx).csv", DataFrame(x=x))
   push!(mean_squared_input, mean(x.^2))
end

samples = ["$(dir)/sample_$(sample_idx).csv" for sample_idx in 1:5]

## define rules 

# function that creates a Rule for a given sample
get_rule_square(sample) = Dagger.Rule(sample => replace(sample, "sample_" => "sample_squared_"); forcerun=false) do input, output
   df = CSV.read(input[1], DataFrame)
   df.xsquared = df.x .^ 2
   CSV.write(output[1], df)
   output
end

squared_rules = get_rule_square.(samples)
squared_rule_outputs = [only(r.outputs) for r in squared_rules]

make_summary = Dagger.Rule(squared_rule_outputs => "$(dir)/samples_summary.csv"; forcerun=false) do inputs, output
   dfs = CSV.read.(inputs, DataFrame)
   mean_squared = DataFrame(sample = inputs, mean_squared = [mean(df.xsquared) for df in dfs])
   CSV.write(output[1], mean_squared)
   output
end

## Run 

squared = [Dagger.@spawn r() for r in squared_rules]
summary_file = Dagger.@spawn make_summary(squared...)

out = CSV.read(fetch(summary_file), DataFrame)

@assert out.mean_squared == mean_squared_input

@assert Dagger.needs_update(make_summary) == false
run(`touch $(squared_rule_outputs[1])`)
sleep(0.5)
@assert Dagger.needs_update(make_summary) == true

run(`rm $(squared_rule_outputs[1])`)
summary_file = Dagger.@spawn make_summary(squared...) # fails

squared = [Dagger.@spawn r() for r in squared_rules] # redo only 1 file

summary_file = Dagger.@spawn make_summary(squared...)  

@jpsamaroo
Copy link
Member

Very cool! Right now this API feels a bit cumbersome to me, so I'd like us to think on ways to make things feel "smoother" and more automatic (not that I really know what that means right now). Something that would be really nice is if this would integrate with Dagger.File and Dagger.tofile, which are used for lazy-loading and saving of files, respectively. Again, not sure what that would look like, but I'm open to ideas 😄

@jonathanBieler
Copy link
Author

Well that's the API right now :

Dagger.Rule(
    user_function,
    input_path => output_path
)

I'm not sure it can be much simpler. Maybe if input_path/output_path aren't defined it could default to some tofile based caching ? Often you want to start with raw data and end up with some "publishable" outputs, plots, a report, ... but the intermediate steps you might not care too much about, so if they could be automatically managed it would be nicer.

But I agree the whole thing is a bit cumbersome. One issue is that in snakemake you define the rules using files as input/outputs and then snakemake will build and execute the graph for you. Here you have to define the rules and still build the graph manually by spawning things in the right order with the right arguments.

Another issue is that if you modify the user_function the code won't rerun since it checks only the input/output dates but not the code, so you can get wrong results if you're not careful.

My intuition is that you either have to go with a bare bones design (what I've tried to do) and let the user do most of the work, or go all in with caching, a CLI, ... like snakemake/nextflow (which would be a separate package I think), and that something in between is a bit awkward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants